{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tidy Up Web-Scraped Media Franchise Data\n",
    "\n",
    "## Background\n",
    "This example combines functionalities of [pyjanitor](https://anaconda.org/conda-forge/pyjanitor) and [pandas-flavor](https://anaconda.org/conda-forge/pandas-flavor) to showcase an explicit--and thus reproducible--workflow enabled by dataframe __method chaining__.\n",
    "\n",
    "The data cleaning workflow largely follows the [R example](https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-07-02/revenue.R) from [the tidytuesday project](https://github.com/rfordatascience/tidytuesday). The raw data is scraped from [Wikipedia page](https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises) titled \"*List of highest-grossing media franchises*\". The workflow is presented both in multi-step (section1) and in one-shot (section 2) fashions.\n",
    "\n",
    "More specifically, you will find several data-cleaning techniques that one may encounter frequently in web-scraping tasks; This includes:\n",
    "\n",
    "* String operations with regular expressions (with `pandas-favor`)\n",
    "* Data type changes (with `pyjanitor`)\n",
    "* Split strings in cells into separate rows (with `pandas-flavor`)\n",
    "* Split strings in cells into separate columns (with `pyjanitor` + `pandas-flavor`)\n",
    "* Filter dataframe values based on substring pattern (with `pyjanitor`)\n",
    "* Column value remapping with fuzzy substring matching (with `pyjanitor` + `pandas-flavor`)\n",
    "\n",
    "Data visualization is not included in this example. But if you are looking for inspirations, [here](https://www.reddit.com/r/dataisbeautiful/comments/c53540/highest_grossing_media_franchises_oc/) is a good example. \n",
    "\n",
    "---\n",
    "\n",
    "## Structural convention\n",
    "### 1. Annotation system in code comments\n",
    "This example includes both `pyjanitor` and `pandas-flavors` methods. As you step through this example, you will see the following annotation system in code comments that explains various methods' categories:\n",
    "\n",
    "* `[pyjanitor]` denotes `pyjanitor` methods\n",
    "* `[pandas-flavor]` denotes custom `pandas-flavor` methods\n",
    "* `[pyjanitor + pandas-flavor]` denotes `pandas-flavor` methods built on top of `pyjanitor` functions\n",
    "\n",
    "### 2. R counterpart snippets and python code in tandem\n",
    "The multi-step workflow is presented by alternating the original R snippets (from tidytuesday) and the corresponding python implementations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Python implementation\n",
    "\n",
    "### Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import pyjanitor and pandas\n",
    "from typing import List\n",
    "\n",
    "import pandas as pd\n",
    "import pandas_flavor as pf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Suppress user warnings\n",
    "# when we try overwriting our custom pandas flavor functions\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Section 1 Multi-Step\n",
    "#### Load data\n",
    "\n",
    "##### Note: The table from the url has been saved as a csv file for use in this example notebook.\n",
    "\n",
    "R snippet:\n",
    "```R\n",
    "url <- \"https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises\"\n",
    "df <- url %>% \n",
    "  read_html() %>% \n",
    "  html_table(fill = TRUE) %>% \n",
    "  .[[2]]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Franchise</th>\n",
       "      <th>Year of inception</th>\n",
       "      <th>Total revenue (USD)</th>\n",
       "      <th>Revenue breakdown (est.)</th>\n",
       "      <th>Original media</th>\n",
       "      <th>Creator(s)</th>\n",
       "      <th>Owner(s)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>$19.9 billion[da]</td>\n",
       "      <td>Book sales – $9.125 billion[245] Box office – ...</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>James Bond</td>\n",
       "      <td>1953</td>\n",
       "      <td>est. $19.9 billion[db]</td>\n",
       "      <td>Box office – $7.078 billion[249] Home video sa...</td>\n",
       "      <td>Novel</td>\n",
       "      <td>Ian Fleming</td>\n",
       "      <td>Metro-Goldwyn-Mayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Peanuts</td>\n",
       "      <td>1950</td>\n",
       "      <td>est. $19.1 billion</td>\n",
       "      <td>Retail sales – $18.805 billion[dc] Box office ...</td>\n",
       "      <td>Comic strip</td>\n",
       "      <td>Charles M. Schulz</td>\n",
       "      <td>Sony Music Entertainment Japan (Sony) WildBrai...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              Franchise  Year of inception  \\\n",
       "0  Middle-earth (The Lord of the Rings)               1937   \n",
       "1                            James Bond               1953   \n",
       "2                               Peanuts               1950   \n",
       "\n",
       "      Total revenue (USD)                           Revenue breakdown (est.)  \\\n",
       "0       $19.9 billion[da]  Book sales – $9.125 billion[245] Box office – ...   \n",
       "1  est. $19.9 billion[db]  Box office – $7.078 billion[249] Home video sa...   \n",
       "2      est. $19.1 billion  Retail sales – $18.805 billion[dc] Box office ...   \n",
       "\n",
       "  Original media         Creator(s)  \\\n",
       "0          Novel   J. R. R. Tolkien   \n",
       "1          Novel        Ian Fleming   \n",
       "2    Comic strip  Charles M. Schulz   \n",
       "\n",
       "                                            Owner(s)  \n",
       "0  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "1                                Metro-Goldwyn-Mayer  \n",
       "2  Sony Music Entertainment Japan (Sony) WildBrai...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# originally from\n",
    "# https://en.wikipedia.org/wiki/List_of_highest-grossing_media_franchises\n",
    "fileurl = \"../data/medium_franchise_raw_table.csv\"\n",
    "df_raw = pd.read_csv(fileurl)\n",
    "df_raw.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Rename columns\n",
    "R snippet:\n",
    "```R\n",
    "clean_money <- df %>% \n",
    "  set_names(nm = c(\"franchise\", \"year_created\", \"total_revenue\", \"revenue_items\",\n",
    "                   \"original_media\", \"creators\", \"owners\"))\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandas-flavor helper functions\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def str_remove(df, column_name: str, pattern: str = \"\"):\n",
    "    \"\"\"Remove string pattern from a column\n",
    "\n",
    "    Wrapper around df.str.replace()\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "    pattern: str, default to ''\n",
    "        String pattern to be removed\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    df[column_name] = df[column_name].str.replace(pattern, \"\")\n",
    "    return df\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def str_trim(df, column_name: str):\n",
    "    \"\"\"Remove leading and trailing white space from a column of strings\n",
    "\n",
    "    Wrapper around df.str.strip()\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    df[column_name] = df[column_name].str.strip()\n",
    "    return df\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def str_slice(df, column_name: str, start: int = 0, stop: int = -1):\n",
    "    \"\"\"Slice a column of strings by indexes\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "    start: int, optional, default to 0\n",
    "        Staring index for string slicing\n",
    "    stop: int, optional, default to -1\n",
    "        Ending index for string slicing\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    df[column_name] = df[column_name].str[start:stop]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue_items</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "      <th>owners</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>$19.9 billion[da]</td>\n",
       "      <td>Book sales – $9.125 billion[245] Box office – ...</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>James Bond</td>\n",
       "      <td>1953</td>\n",
       "      <td>est. $19.9 billion[db]</td>\n",
       "      <td>Box office – $7.078 billion[249] Home video sa...</td>\n",
       "      <td>Novel</td>\n",
       "      <td>Ian Fleming</td>\n",
       "      <td>Metro-Goldwyn-Mayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Peanuts</td>\n",
       "      <td>1950</td>\n",
       "      <td>est. $19.1 billion</td>\n",
       "      <td>Retail sales – $18.805 billion[dc] Box office ...</td>\n",
       "      <td>Comic strip</td>\n",
       "      <td>Charles M. Schulz</td>\n",
       "      <td>Sony Music Entertainment Japan (Sony) WildBrai...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              franchise  year_created           total_revenue  \\\n",
       "0  Middle-earth (The Lord of the Rings)          1937       $19.9 billion[da]   \n",
       "1                            James Bond          1953  est. $19.9 billion[db]   \n",
       "2                               Peanuts          1950      est. $19.1 billion   \n",
       "\n",
       "                                       revenue_items original_media  \\\n",
       "0  Book sales – $9.125 billion[245] Box office – ...          Novel   \n",
       "1  Box office – $7.078 billion[249] Home video sa...          Novel   \n",
       "2  Retail sales – $18.805 billion[dc] Box office ...    Comic strip   \n",
       "\n",
       "            creators                                             owners  \n",
       "0   J. R. R. Tolkien  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "1        Ian Fleming                                Metro-Goldwyn-Mayer  \n",
       "2  Charles M. Schulz  Sony Music Entertainment Japan (Sony) WildBrai...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "colnames = (\n",
    "    \"franchise\",\n",
    "    \"year_created\",\n",
    "    \"total_revenue\",\n",
    "    \"revenue_items\",\n",
    "    \"original_media\",\n",
    "    \"creators\",\n",
    "    \"owners\",\n",
    ")\n",
    "df_dirty = df_raw.rename(\n",
    "    columns={col_old: col_new for col_old, col_new in zip(df_raw.columns, colnames)}\n",
    ")\n",
    "df_dirty.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Clean up `total_revenue` column\n",
    "R snippet:\n",
    "```R\n",
    "clean_money <- df %>% ... %>%\n",
    "mutate(total_revenue = str_remove(total_revenue, \"est.\"),\n",
    "     total_revenue = str_trim(total_revenue),\n",
    "     total_revenue = str_remove(total_revenue, \"[$]\"),\n",
    "     total_revenue = word(total_revenue, 1, 1),\n",
    "     total_revenue = as.double(total_revenue))\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue_items</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "      <th>owners</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Book sales – $9.125 billion[245] Box office – ...</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>James Bond</td>\n",
       "      <td>1953</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Box office – $7.078 billion[249] Home video sa...</td>\n",
       "      <td>Novel</td>\n",
       "      <td>Ian Fleming</td>\n",
       "      <td>Metro-Goldwyn-Mayer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Peanuts</td>\n",
       "      <td>1950</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Retail sales – $18.805 billion[dc] Box office ...</td>\n",
       "      <td>Comic strip</td>\n",
       "      <td>Charles M. Schulz</td>\n",
       "      <td>Sony Music Entertainment Japan (Sony) WildBrai...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              franchise  year_created  total_revenue  \\\n",
       "0  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "1                            James Bond          1953           19.0   \n",
       "2                               Peanuts          1950           19.0   \n",
       "\n",
       "                                       revenue_items original_media  \\\n",
       "0  Book sales – $9.125 billion[245] Box office – ...          Novel   \n",
       "1  Box office – $7.078 billion[249] Home video sa...          Novel   \n",
       "2  Retail sales – $18.805 billion[dc] Box office ...    Comic strip   \n",
       "\n",
       "            creators                                             owners  \n",
       "0   J. R. R. Tolkien  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "1        Ian Fleming                                Metro-Goldwyn-Mayer  \n",
       "2  Charles M. Schulz  Sony Music Entertainment Japan (Sony) WildBrai...  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "column_name = \"total_revenue\"\n",
    "df_clean_money = (\n",
    "    df_dirty.str_remove(column_name, pattern=\"est.\")  # [pandas-flavor]\n",
    "    .str_trim(column_name)  # [pandas-flavor]\n",
    "    .str_remove(column_name, pattern=\"\\$\")  # [pandas-flavor]\n",
    "    .str_slice(column_name, start=0, stop=2)  # [pandas-flavor]\n",
    "    .change_type(column_name, float)  # [pyjanitor]\n",
    ")\n",
    "df_clean_money.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Split column `revenue_items` into `revenue_category` and `revenue`\n",
    "R snippet:\n",
    "```R\n",
    "clean_category <- clean_money %>% \n",
    "    separate_rows(revenue_items, sep = \"\\\\[\") %>% \n",
    "    filter(str_detect(revenue_items, \"illion\")) %>% \n",
    "    separate(revenue_items, into = c(\"revenue_category\", \"revenue\"), sep = \"[$]\") %>% \n",
    "    mutate(revenue_category = str_remove(revenue_category, \" – \"),\n",
    "         revenue_category = str_remove(revenue_category, regex(\".*\\\\]\")),\n",
    "         revenue_category = str_remove(revenue_category, \"\\n\")) \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandas-flavor helper functions\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def separate_rows(df, column_name: str, sep: str = \"\"):\n",
    "    \"\"\"Split each cell of a column that contains a list of items\n",
    "    (separated by `sep`) into separate rows\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "    sep: str, default to ''\n",
    "        Substring used as separators for cell splitting\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    # Preserve an id field for later merge\n",
    "    columns_original = list(df.columns)\n",
    "    df[\"id\"] = df.index\n",
    "    wdf = pd.DataFrame(df[column_name].str.split(sep).tolist()).stack().reset_index()\n",
    "    # Preserve the same id field for merge\n",
    "    wdf.rename(columns={\"level_0\": \"id\", 0: \"revenue_items\"}, inplace=True)\n",
    "    wdf.drop(columns=[\"level_1\"], inplace=True)\n",
    "    # Merge and preserve original order\n",
    "    return pd.merge(df, wdf, on=\"id\", suffixes=(\"_drop\", \"\")).drop(\n",
    "        columns=[\"id\", column_name + \"_drop\"]\n",
    "    )[columns_original]\n",
    "\n",
    "\n",
    "# [pyjanitor + pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def separate(df, column_name: str, into: List[str] = None, sep: str = \"\"):\n",
    "    \"\"\"Split a column into separate columns at separator specified by `sep`\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "    into: List[str], default to None\n",
    "        New column names for the split columns\n",
    "    sep: str, default to ''\n",
    "        Separator at which to split the column\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    index_original = list(df.columns).index(column_name)\n",
    "    cols = list(df.columns)\n",
    "    cols.remove(column_name)\n",
    "    for i, col in enumerate(into):\n",
    "        cols.insert(index_original + i, col)\n",
    "    return df.deconcatenate_column(column_name, new_column_names=into, sep=sep).drop(\n",
    "        columns=column_name\n",
    "    )[cols]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue_category</th>\n",
       "      <th>revenue</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "      <th>owners</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Book sales</td>\n",
       "      <td>9.125 billion</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Box office</td>\n",
       "      <td>5.896 billion</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Home video sales</td>\n",
       "      <td>4.194 billion</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              franchise  year_created  total_revenue  \\\n",
       "0  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "1  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "2  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "\n",
       "    revenue_category        revenue original_media          creators  \\\n",
       "0         Book sales  9.125 billion          Novel  J. R. R. Tolkien   \n",
       "1         Box office  5.896 billion          Novel  J. R. R. Tolkien   \n",
       "2   Home video sales  4.194 billion          Novel  J. R. R. Tolkien   \n",
       "\n",
       "                                              owners  \n",
       "0  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "1  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "2  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate `df_clean_category` on top of `df_clean_money`\n",
    "column_name = \"revenue_items\"\n",
    "df_clean_category = (\n",
    "    df_clean_money.separate_rows(column_name, sep=\"\\[\")  # [pandas-flavor]\n",
    "    .filter_string(column_name, \"illion\")  # [pyjanitor]\n",
    "    .separate(\n",
    "        column_name, into=[\"revenue_category\", \"revenue\"], sep=\"\\$\"\n",
    "    )  # [pyjanitor + pandas-flavor]\n",
    "    .str_remove(\"revenue_category\", pattern=\" – \")  # [pandas-flavor]\n",
    "    .str_remove(\"revenue_category\", pattern=\".*\\]\")  # [pandas-flavor]\n",
    "    .str_remove(\"revenue_category\", pattern=\"\\n\")  # [pandas-flavor]\n",
    ")\n",
    "df_clean_category.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Clean up `revenue_category` column\n",
    "R snippet:\n",
    "```R\n",
    "clean_df <- clean_category %>% \n",
    "  mutate(revenue_category = case_when(\n",
    "    str_detect(str_to_lower(revenue_category), \"box office\") ~ \"Box Office\",\n",
    "    str_detect(str_to_lower(revenue_category), \"dvd|blu|vhs|home video|video rentals|video sales|streaming|home entertainment\") ~ \"Home Video/Entertainment\",\n",
    "    str_detect(str_to_lower(revenue_category), \"video game|computer game|mobile game|console|game|pachinko|pet|card\") ~ \"Video Games/Games\",\n",
    "    str_detect(str_to_lower(revenue_category), \"comic|manga\") ~ \"Comic or Manga\",\n",
    "    str_detect(str_to_lower(revenue_category), \"music|soundtrack\") ~ \"Music\",\n",
    "    str_detect(str_to_lower(revenue_category), \"tv\") ~ \"TV\",\n",
    "    str_detect(str_to_lower(revenue_category), \"merchandise|licens|mall|stage|retail\") ~ \"Merchandise, Licensing & Retail\",\n",
    "    TRUE ~ revenue_category))\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandas-flavor helper functions\n",
    "\n",
    "\n",
    "# [pyjanitor + pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def fuzzy_match_replace(df, column_name: str, mapper: dict = None):\n",
    "    \"\"\"Value remapping for specific column with fuzzy matching and replacement\n",
    "    of strings\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "    mapper: dict, default to None\n",
    "        {oldstr_0: newstr_0, oldstr_1: newstr_1, ..., oldstr_n: newstr_n}\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    for k, v in mapper.items():\n",
    "        condition = df[column_name].str.contains(k)\n",
    "        # [pyjanitor] update_where: update value when condition is True\n",
    "        df = df.update_where(condition, column_name, v)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue_category</th>\n",
       "      <th>revenue</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "      <th>owners</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>book sales</td>\n",
       "      <td>9.125 billion</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Box Office</td>\n",
       "      <td>5.896 billion</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Home Video/Entertainment</td>\n",
       "      <td>4.194 billion</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              franchise  year_created  total_revenue  \\\n",
       "0  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "1  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "2  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "\n",
       "           revenue_category        revenue original_media          creators  \\\n",
       "0                book sales  9.125 billion          Novel  J. R. R. Tolkien   \n",
       "1                Box Office  5.896 billion          Novel  J. R. R. Tolkien   \n",
       "2  Home Video/Entertainment  4.194 billion          Novel  J. R. R. Tolkien   \n",
       "\n",
       "                                              owners  \n",
       "0  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "1  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "2  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Value mapper `revenue_category`\n",
    "value_mapper = {\n",
    "    \"box office\": \"Box Office\",\n",
    "    \"dvd|blu|vhs|home video|video rentals|video sales|streaming|home entertainment\": \"Home Video/Entertainment\",  # noqa: E501\n",
    "    \"video game|computer game|mobile game|console|game|pachinko|pet|card\": \"Video Games/Games\",  # noqa: E501\n",
    "    \"comic|manga\": \"Comic or Manga\",\n",
    "    \"music|soundtrac\": \"Music\",\n",
    "    \"tv\": \"TV\",\n",
    "    \"merchandise|licens|mall|stage|retail\": \"Merchandise, Licensing & Retail\",\n",
    "}\n",
    "\n",
    "column_name = \"revenue_category\"\n",
    "# [pyjanitor] convert to lower case\n",
    "df_clean_category = (\n",
    "    df_clean_category.transform_column(column_name, str.lower)\n",
    "    # [pyjanitor] strip leading/trailing white space\n",
    "    .transform_column(column_name, str.strip)\n",
    "    # [pyjanitor + pandas_flavor]\n",
    "    .fuzzy_match_replace(column_name, mapper=value_mapper)\n",
    ")\n",
    "df_clean_category.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Clean up `revenue` column\n",
    "R snippet:\n",
    "```R\n",
    "%>% \n",
    "mutate(revenue = str_remove(revenue, \"illion\"),\n",
    "     revenue = str_trim(revenue),\n",
    "     revenue = str_remove(revenue, \" \"),\n",
    "     revenue = case_when(str_detect(revenue, \"m\") ~ paste0(str_extract(revenue, \"[:digit:]+\"), \"e-3\"),\n",
    "                         str_detect(revenue, \"b\") ~ str_extract(revenue, \"[:digit:]+\"),\n",
    "                         TRUE ~ NA_character_),\n",
    "     revenue = format(revenue, scientific = FALSE),\n",
    "     revenue = parse_number(revenue)) %>%\n",
    "mutate(original_media = str_remove(original_media, \"\\\\[.+\"))\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandas-flavor helper functions\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def str_replace(df, column_name: str, old: str = \"\", new: str = \"\"):\n",
    "    \"\"\"Match and replace strings from a dataframe column.\n",
    "    Wrapper around df.str.replace\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "    column_name: str\n",
    "        Name of the column to be operated on\n",
    "    old: str, default to ''\n",
    "        Old string to be matched and replaced\n",
    "    new: str, default to ''\n",
    "        New string to replace old\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    df[column_name] = df[column_name].str.replace(old, new)\n",
    "    return df\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def parse_number(df):\n",
    "    \"\"\"Check all columns of dataframe and properly parse numeric types\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    for col in df.columns:\n",
    "        try:\n",
    "            df[col] = pd.to_numeric(df[col])\n",
    "        except ValueError:\n",
    "            continue\n",
    "    return df\n",
    "\n",
    "\n",
    "# [pandas-flavor]\n",
    "@pf.register_dataframe_method\n",
    "def flatten_multiindex(df):\n",
    "    \"\"\"Flatten dataframe with multilevel index\n",
    "    A wrapper around pd.DataFrame(df.to_records())\n",
    "\n",
    "    Parameters\n",
    "    -----------\n",
    "    df: pd.Dataframe\n",
    "        Input dataframe to be modified\n",
    "\n",
    "    Returns\n",
    "    --------\n",
    "    df: pd.Dataframe\n",
    "\n",
    "    \"\"\"\n",
    "    return pd.DataFrame(df.to_records())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue_category</th>\n",
       "      <th>revenue</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "      <th>owners</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>book sales</td>\n",
       "      <td>9.125</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Box Office</td>\n",
       "      <td>5.896</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>1937</td>\n",
       "      <td>19.0</td>\n",
       "      <td>Home Video/Entertainment</td>\n",
       "      <td>4.194</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "      <td>Tolkien Estate (books) Warner Bros. (AT&amp;T) (fi...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              franchise  year_created  total_revenue  \\\n",
       "0  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "1  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "2  Middle-earth (The Lord of the Rings)          1937           19.0   \n",
       "\n",
       "           revenue_category  revenue original_media          creators  \\\n",
       "0                book sales    9.125          Novel  J. R. R. Tolkien   \n",
       "1                Box Office    5.896          Novel  J. R. R. Tolkien   \n",
       "2  Home Video/Entertainment    4.194          Novel  J. R. R. Tolkien   \n",
       "\n",
       "                                              owners  \n",
       "0  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "1  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  \n",
       "2  Tolkien Estate (books) Warner Bros. (AT&T) (fi...  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# clean up revenue values\n",
    "column_name = \"revenue\"\n",
    "df_clean = (\n",
    "    df_clean_category.str_remove(column_name, \"illion\")  # [pandas-flavor]\n",
    "    .str_trim(column_name)  # [pandas-flavor]\n",
    "    .str_remove(column_name, \" \")  # [pandas-flavor]\n",
    "    .str_replace(column_name, \"\\s*b\", \"\")  # [pandas-flavor]\n",
    "    .str_replace(column_name, \"\\s*m\", \"e-3\")  # [pandas-flavor]\n",
    "    .parse_number()  # [pandas-flavor]\n",
    "    .str_remove(\"original_media\", \"\\[.+\")  # [pandas-flavor]\n",
    ")\n",
    "df_clean.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Section 2 One-Shot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_clean = (\n",
    "    pd.read_csv(fileurl)\n",
    "    .rename(\n",
    "        columns={col_old: col_new for col_old, col_new in zip(df_raw.columns, colnames)}\n",
    "    )\n",
    "    .str_remove(\"total_revenue\", pattern=\"est.\")  # [pandas-flavor]\n",
    "    .str_trim(\"total_revenue\")  # [pandas-flavor]\n",
    "    .str_remove(\"total_revenue\", pattern=\"\\$\")  # [pandas-flavor]\n",
    "    .str_slice(\"total_revenue\", start=0, stop=2)  # [pandas-flavor]\n",
    "    .change_type(\"total_revenue\", float)  # [pyjanitor]\n",
    "    .separate_rows(\"revenue_items\", sep=\"\\[\")  # [pandas-flavor]\n",
    "    .filter_string(\"revenue_items\", \"illion\")  # [pyjanitor]\n",
    "    # [pyjanitor + pandas-flavor]\n",
    "    .separate(\"revenue_items\", into=[\"revenue_category\", \"revenue\"], sep=\"\\$\")\n",
    "    # [pandas-flavor]\n",
    "    .str_remove(\"revenue_category\", pattern=\" – \")\n",
    "    .str_remove(\"revenue_category\", pattern=\".*\\]\")\n",
    "    .str_remove(\"revenue_category\", pattern=\"\\n\")\n",
    "    # [pyjanitor] convert to lower case\n",
    "    .transform_column(\"revenue_category\", str.lower)\n",
    "    # [pyjanitor] strip leading/trailing white space\n",
    "    .transform_column(\"revenue_category\", str.strip)\n",
    "    # [pyjanitor + pandas_flavor]\n",
    "    .fuzzy_match_replace(\"revenue_category\", mapper=value_mapper)\n",
    "    .str_remove(\"revenue\", \"illion\")  # [pandas-flavor]\n",
    "    .str_trim(\"revenue\")  # [pandas-flavor]\n",
    "    .str_remove(\"revenue\", \" \")  # [pandas-flavor]\n",
    "    .str_replace(\"revenue\", \"\\s*b\", \"\")  # [pandas-flavor]\n",
    "    .str_replace(\"revenue\", \"\\s*m\", \"e-3\")  # [pandas-flavor]\n",
    "    .parse_number()  # [pandas-flavor]\n",
    "    .str_remove(\"original_media\", \"\\[.+\")  # [pandas-flavor]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Final aggregation and join\n",
    "R snippet:\n",
    "```R\n",
    " sum_df <- clean_df %>%\n",
    "  group_by(franchise, revenue_category) %>% \n",
    "  summarize(revenue = sum(revenue))\n",
    "\n",
    "metadata_df <- clean_df %>% \n",
    "  select(franchise:revenue_category, original_media:owners, -total_revenue)\n",
    "\n",
    "final_df <- left_join(sum_df, metadata_df, \n",
    "                      by = c(\"franchise\", \"revenue_category\")) %>% \n",
    "  distinct(.keep_all = TRUE)\n",
    "\n",
    "final_df\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>revenue_category</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Avengers</td>\n",
       "      <td>Box Office</td>\n",
       "      <td>1963</td>\n",
       "      <td>15.0</td>\n",
       "      <td>7.765</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Avengers</td>\n",
       "      <td>Comic or Manga</td>\n",
       "      <td>1963</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Avengers</td>\n",
       "      <td>Home Video/Entertainment</td>\n",
       "      <td>1963</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.459</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  franchise          revenue_category  year_created  total_revenue  revenue\n",
       "0  Avengers                Box Office          1963           15.0    7.765\n",
       "1  Avengers            Comic or Manga          1963           15.0    0.131\n",
       "2  Avengers  Home Video/Entertainment          1963           15.0    0.459"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sum = df_clean.groupby([\"franchise\", \"revenue_category\"]).sum().flatten_multiindex()\n",
    "df_sum.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>revenue_category</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>book sales</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>Box Office</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Middle-earth (The Lord of the Rings)</td>\n",
       "      <td>Home Video/Entertainment</td>\n",
       "      <td>Novel</td>\n",
       "      <td>J. R. R. Tolkien</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              franchise          revenue_category  \\\n",
       "0  Middle-earth (The Lord of the Rings)                book sales   \n",
       "1  Middle-earth (The Lord of the Rings)                Box Office   \n",
       "2  Middle-earth (The Lord of the Rings)  Home Video/Entertainment   \n",
       "\n",
       "  original_media          creators  \n",
       "0          Novel  J. R. R. Tolkien  \n",
       "1          Novel  J. R. R. Tolkien  \n",
       "2          Novel  J. R. R. Tolkien  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_metadata = df_clean[[\"franchise\", \"revenue_category\", \"original_media\", \"creators\"]]\n",
    "df_metadata.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Final Dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>franchise</th>\n",
       "      <th>revenue_category</th>\n",
       "      <th>year_created</th>\n",
       "      <th>total_revenue</th>\n",
       "      <th>revenue</th>\n",
       "      <th>original_media</th>\n",
       "      <th>creators</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Avengers</td>\n",
       "      <td>Box Office</td>\n",
       "      <td>1963</td>\n",
       "      <td>15.0</td>\n",
       "      <td>7.765</td>\n",
       "      <td>Comic book</td>\n",
       "      <td>Stan Lee Jack Kirby</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Avengers</td>\n",
       "      <td>Comic or Manga</td>\n",
       "      <td>1963</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.131</td>\n",
       "      <td>Comic book</td>\n",
       "      <td>Stan Lee Jack Kirby</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Avengers</td>\n",
       "      <td>Home Video/Entertainment</td>\n",
       "      <td>1963</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.459</td>\n",
       "      <td>Comic book</td>\n",
       "      <td>Stan Lee Jack Kirby</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  franchise          revenue_category  year_created  total_revenue  revenue  \\\n",
       "0  Avengers                Box Office          1963           15.0    7.765   \n",
       "1  Avengers            Comic or Manga          1963           15.0    0.131   \n",
       "2  Avengers  Home Video/Entertainment          1963           15.0    0.459   \n",
       "\n",
       "  original_media             creators  \n",
       "0     Comic book  Stan Lee Jack Kirby  \n",
       "1     Comic book  Stan Lee Jack Kirby  \n",
       "2     Comic book  Stan Lee Jack Kirby  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate final dataframe\n",
    "df_final = (\n",
    "    pd.merge(df_sum, df_metadata, how=\"left\", on=[\"franchise\", \"revenue_category\"])\n",
    "    .drop_duplicates(keep=\"first\")\n",
    "    .reset_index(drop=True)\n",
    ")\n",
    "df_final.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PyJanitor development",
   "language": "python",
   "name": "pyjanitor-dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
